20 research outputs found

    Hybrid scalar/vector implementations of Keccak and SPHINCS+ on AArch64

    Get PDF
    This paper presents two new techniques for the fast implementation of the Keccak permutation on the A-profile of the Arm architecture: First, the elimination of explicit rotations in the Keccak permutation through Barrel shifting, applicable to scalar AArch64 implementations of Keccak-f1600. Second, the construction of hybrid implementations concurrently leveraging both the scalar and the Neon instruction sets of AArch64. The resulting performance improvements are demonstrated in the example of the hash-based signature scheme SPHINCS+, one of the recently announced winners of the NIST post-quantum cryptography project: We achieve up to 1.89× performance improvements compared to the state of the art. Our implementations target the Arm Cortex-{A55,A510,A78,A710,X1,X2} processors common in client devices such as mobile phones

    Memory-Efficient High-Speed Implementation of Kyber on Cortex-M4

    Get PDF
    This paper presents an optimized software implementation of the module-lattice-based key-encapsulation mechanism Kyber for the ARM Cortex-M4 microcontroller. Kyber is one of the round-2 candidates in the NIST post-quantum project. In the center of our work are novel optimization techniques for the number-theoretic transform (NTT) inside Kyber, which make very efficient use of the computational power offered by the “vector” DSP instructions of the target architecture. We also present results for the recently updated parameter sets of Kyber which equally benefit from our optimizations. As a result of our efforts we present software that is 18% faster than an earlier implementation of Kyber optimized for the Cortex-M4 by the Kyber submitters. Our NTT is more than twice as fast as the NTT in that software. Our software runs at about the same speed as the latest speed-optimized implementation of the other module-lattice based round-2 NIST PQC candidate Saber. However, for our Kyber software, this performance is achieved with a much smaller RAM footprint. Kyber needs less than half of the RAM of what the considerably slower RAM-optimized version of Saber uses. Our software does not make use of any secret-dependent branches or memory access and thus offers state-of-the-art protection against timing attack

    Fast and Clean: Auditable high-performance assembly via constraint solving

    Get PDF
    Handwritten assembly is a widely used tool in the development of high-performance cryptography: By providing full control over instruction selection, instruction scheduling, and register allocation, highest performance can be unlocked. On the flip side, developing handwritten assembly is not only time-consuming, but the artifacts produced also tend to be difficult to review and maintain – threatening their suitability for use in practice. In this work, we present SLOTHY (Super (Lazy) Optimization of Tricky Handwritten assemblY), a framework for the automated superoptimization of assembly with respect to instruction scheduling, register allocation, and loop optimization (software pipelining): With SLOTHY, the developer controls and focuses on algorithm and instruction selection, providing a readable “base” implementation in assembly, while SLOTHY automatically finds optimal and traceable instruction scheduling and register allocation strategies with respect to a model of the target (micro)architecture. We demonstrate the flexibility of SLOTHY by instantiating it with models of the Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72 microarchitectures, implementing the Armv8.1-M+Helium and AArch64+Neon architectures. We use the resulting tools to optimize three workloads: First, for Cortex-M55 and Cortex-M85, a radix-4 complex Fast Fourier Transform (FFT) in fixed-point and floating-point arithmetic, fundamental in Digital Signal Processing. Second, on Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72, the instances of the Number Theoretic Transform (NTT) underlying CRYSTALS-Kyber and CRYSTALS-Dilithium, two recently announced winners of the NIST Post-Quantum Cryptography standardization project. Third, for Cortex-A55, the scalar multiplication for the elliptic curve key exchange X25519. The SLOTHY-optimized code matches or beats the performance of prior art in all cases, while maintaining compactness and readability

    Faster Kyber and Dilithium on the Cortex-M4

    Get PDF
    This paper presents faster implementations of the lattice-based schemes Dilithium and Kyber on the Cortex-M4. Dilithium is one of the three signature finalists in the NIST post-quantum project (NIST PQC), while Kyber is one of the four key-encapsulation mechanism (KEM) finalists. Our optimizations affect the core polynomial arithmetic using the number-theoretic transform (NTT) of both schemes. Our main contributions are threefold: We present a faster signed Barrett reduction for Kyber, propose to switch to a smaller prime modulus for the polynomial multiplications cs1c\mathbf{s}_1 and cs2c\mathbf{s}_2 in the signing procedure of Dilithium, and apply various known optimizations to the polynomial arithmetic in both schemes. Using a smaller prime modulus is particularly interesting as it allows using the Fermat number transform resulting in especially fast code. We outperform the state-of-the-art for both Dilithium and Kyber. For Dilithium, our NTT and iNTT are faster by 5.2% and 5.7%. Switching to a smaller modulus results in speed-up of 33.1%-37.6% for the relevant operations (sum of basemul and iNTT) in the signing procedure. For Kyber, the optimizations results in 15.9%-17.8% faster matrix-vector product which presents the core arithmetic operation in Kyber

    Practical Fault Injection Attacks on SPHINCS

    Get PDF
    The majority of currently deployed cryptographic public-key schemes are at risk of becoming insecure once large scale quantum computers become practical. Therefore, substitutes resistant to quantum attacks楊nown as post-quantum cryptography預re required. In particular, hash-based signature schemes appear to be the most conservative choice for post-quantum digital signatures. In this work, we mount the first practical fault attack against hash-based cryptography. The attack was originally proposed by Castelnovi, Martinelli, and Prest [9] and allows the creation of a universal signature forgery that applies to all current standardisation candidates (XMSS, LMS, SPHINCS+, and Gravity-SPHINCS). We perform the attack on an Arduino Due board featuring an ARM Cortex-M3 microprocessor running the original stateless scheme SPHINCS with a focus on practicality. We describe how the attack is mountable with a simple voltage glitch injection on the targeted platform, which allowed us to collect enough faulty signatures to create a universal forgery within seconds. As the attack also applies to stateful schemes, we show how caching one-time signatures can entirely prevent the attack for stateful schemes, such as XMSS and LMS. However, we discuss how protecting stateless schemes, like SPHINCS, SPHINCS+, and Gravity-SPHINCS, is more challenging, as this countermeasure does not apply as efficiently as in stateful schemes

    Nibbling MAYO: Optimized Implementations for AVX2 and Cortex-M4

    Get PDF
    MAYO is a popular high-calorie condiment as well as an auspicious candidate in the ongoing NIST competition for additional post-quantum signature schemes achieving competitive signature and public key sizes. In this work, we present high-speed implementations of MAYO using the AVX2 and Armv7E-M instruction sets targeting recent x86 platforms and the Arm Cortex-M4. Moreover, the main contribution of our work is showing that MAYO can be even faster when switching from a bitsliced representation of keys to a nibble-sliced representation. While the bitsliced representation was primarily motivated by faster arithmetic on microcontrollers, we show that it is not necessary for achieving high performance on Cortex-M4. On Cortex-M4, we instead propose to implement the large matrix multiplications of MAYO using the Method of the Four Russians (M4R), which allows us to achieve better performance than when using the bitsliced approach. This results in up to 21% faster signing. For AVX2, the change in representation allows us to implement the arithmetic much faster using shuffle instructions. Signing takes up to 3.2x fewer cycles and key generation and verification enjoy similar speedups. This shows that MAYO is competitive with lattice-based signature schemes on x86 CPUs, and a factor of 2-6 slower than lattice-based signature schemes on Cortex-M4 (which can still be considered competitive)

    Efficient Multiplication of Somewhat Small Integers using Number-Theoretic Transforms

    Get PDF
    Conventional wisdom purports that FFT-based integer multiplication methods (such as the Schönhage-Strassen algorithm) begin to compete with Karatsuba and Toom-Cook only for integers of several tens of thousands of bits. In this work, we challenge this belief, leveraging recent advances in the implementation of number-theoretic transforms (NTT) stimulated by their use in post-quantum cryptography. We report on implementations of NTT-based integer arithmetic on two Arm Cortex-M CPUs on opposite ends of the performance spectrum: Cortex-M3 and Cortex-M55. Our results indicate that NTT-based multiplication is capable of outperforming the big-number arithmetic implementations of popular embedded cryptography libraries for integers as small as 2048 bits. To provide a realistic case study, we benchmark implementations of the RSA encryption and decryption operations. Our cycle counts on Cortex-M55 are about 10× lower than on Cortex-M3

    First-Order Masked Kyber on ARM Cortex-M4

    Get PDF
    In this work, we present a fast and first-order secure Kyber implementation optimized for ARM Cortex-M4. Most notably, to our knowledge this is the first liberally-licensed open-source Cortex-M4 implementation of masked Kyber. The ongoing NIST standardization process for post-quantum cryptography and newly proposed side-channel attacks have increased the demand for side-channel analysis and countermeasures for the finalists. On the foundation of the commonly used PQM4 project, we make use of the previously presented optimizations for Kyber on a Cortex-M4 and further combine different ideas from various recent works to achieve a better performance and improve the security in comparison to the original implementations. We show our performance results for first-order secure implementations. Our masked Kyber768 decapsulation on the ARM Cortex-M4 requires only 2 978 441 cycles, including randomness generation from the internal RNG. We then practically verify our implementation by using the t-test methodology with 100 000 traces

    NTT Multiplication for NTT-unfriendly Rings

    Get PDF
    In this paper, we show how multiplication for polynomial rings used in the NIST PQC finalists Saber and NTRU can be efficiently implemented using the Number-theoretic transform (NTT). We obtain superior performance compared to the previous state of the art implementations using Toom–Cook multiplication on both NIST’s primary software optimization targets AVX2 and Cortex-M4. Interestingly, these two platforms require different approaches: On the Cortex-M4, we use 32-bit NTT-based polynomial multiplication, while on Intel we use two 16-bit NTT-based polynomial multiplications and combine the products using the Chinese Remainder Theorem (CRT). For Saber, the performance gain is particularly pronounced. On Cortex-M4, the Saber NTT-based matrix-vector multiplication is 61% faster than the Toom-Cook multiplication resulting in 22% fewer cycles for Saber encapsulation. For NTRU, the speed-up is less impressive, but still NTT-based multiplication performs better than Toom–Cook for all parameter sets on Cortex-M4. The NTT-based polynomial multiplication for NTRU-HRSS is 10% faster than Toom–Cook which results in a 6% cost reduction for encapsulation. On AVX2, we obtain speed-ups for three out of four NTRU parameter sets. As a further illustration, we also include code for AVX2 and Cortex-M4 for the Chinese Association for Cryptologic Research competition award winner LAC (also a NIST round 2 candidate) which outperforms existing code

    Is Java Card ready for hash-based signatures?

    Get PDF
    The current Java Card platform does not seem to allow for fast implementations of hash-based signature schemes. While the underlying implementation of the cryptographic primitives provided by the API can be fast, thanks to implementations in native code or in hardware, the cumulative overhead of the many separate API calls results in prohibitive performance for many common applications. In this work, we present an implementation of XMSSMT^{MT} on the current Java Card platform, and make suggestions how to improve this platform in future versions
    corecore